The Scenario: For this project, we take on a the role of a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, our team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, our team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve our recommendations, so they must be backed up with compelling data insights and professional data visualizations.
Below, we carry out this task and follow the key steps of the data analysis process: ask, prepare, process, analyse, share, and act.
First, we clarify the marketing department’s guiding question: How do annual members and casual riders use Cyclistic bikes differently?
The key business task is therefore to uncover the key differences between annual member and casual rider bike use.
By the end we hope to be able to present three main differences between member and casual bike use.
The dataset we are to use is located here and was made available by Motivate International Inc. under this license. It consists of data from the Chicago-based bicycle sharing service “Divvy”, which, for the purpose of this project, we consider to be reliable, original, comprehensive, and current data for our fictional company “Cyclistic”.
The dataset includes data from as far back as 2013. Data is packaged as zip-files in monthly increments, but data from before 2020-04 is quarterly or bi-annual. For this project, we use the monthly data ranging from 2020-04 to 2021-07.
To begin, we read in the downloaded monthly data into R-Studio, bind each month together and examine the raw data using the package skim.
options(width=110)
csv_list<-dir(here("Data"),pattern=".csv")
cycle_data = tibble()
for(i in 1:length(csv_list)){
df <- read_csv(paste(here("Data",csv_list[i])),show_col_types = FALSE)
cycle_data <- rbind(cycle_data,df)
}
cycle_data%>%skim_tee()
## ── Data Summary ────────────────────────
## Values
## Name data
## Number of rows 5910616
## Number of columns 13
## _______________________
## Column type frequency:
## character 7
## numeric 4
## POSIXct 2
## ________________________
## Group variables None
##
## ── Variable type: character ──────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max empty n_unique whitespace
## 1 ride_id 0 1 16 16 0 5910407 0
## 2 rideable_type 0 1 11 13 0 3 0
## 3 start_station_name 369331 0.938 10 53 0 749 0
## 4 start_station_id 369956 0.937 1 36 0 1287 0
## 5 end_station_name 409155 0.931 10 53 0 746 0
## 6 end_station_id 409616 0.931 1 36 0 1286 0
## 7 member_casual 0 1 6 6 0 2 0
##
## ── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 start_lat 0 1 41.9 0.0437 41.6 41.9 41.9 41.9 42.1 ▁▁▇▇▁
## 2 start_lng 0 1 -87.6 0.0261 -87.9 -87.7 -87.6 -87.6 -87.5 ▁▁▂▇▁
## 3 end_lat 6905 0.999 41.9 0.0438 41.5 41.9 41.9 41.9 42.2 ▁▁▇▇▁
## 4 end_lng 6905 0.999 -87.6 0.0263 -88.1 -87.7 -87.6 -87.6 -87.4 ▁▁▁▇▁
##
## ── Variable type: POSIXct ────────────────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max median n_unique
## 1 started_at 0 1 2020-04-01 00:00:30 2021-07-31 23:59:58 2020-11-25 11:10:06 5035619
## 2 ended_at 0 1 2020-04-01 00:10:45 2021-08-12 17:45:41 2020-11-25 10:46:32 5021291
Based on this initial read-in, we can see that there are several data validity issues:
First, the number of rows (5910616) does not match the number of unique ride_ids (5910407). This indicates that there may be duplicate entries in the dataset.
Second, for six features there are a number of rows missing, with the extent of this varying. Each of these features relate to the starting or ending location of the rides.
Third, station names and station ids have a differing number of unique values. When we explore this further, this is found to be because stations have multiple ids and that the reason for this is probably due to a renaming of a certain number of stations that took place between 2020-11-30 and 2020-12-01.
df1 <-cycle_data %>%
select(start_station_name,start_station_id,started_at)%>%
group_by(start_station_name,start_station_id)%>%
summarise(s_start_date = min(started_at),
s_end_date=(max(started_at)),
.groups='drop')%>%
as.data.frame()
df2 <- cycle_data %>%
select(end_station_name,end_station_id,ended_at)%>%
group_by(end_station_name,end_station_id)%>%
summarise(e_start_date = min(ended_at),
e_end_date=(max(ended_at)),
.groups='drop')%>%
as.data.frame()
full_df <- full_join(df1, df2, by = c("start_station_name" = "end_station_name","start_station_id"="end_station_id"))
full_df
The data can be used to answer several key questions, including, but not limited to, the following:
Due to privacy issues, the data does not include member, price, or payment information.
This means that it is impossible to identify repeat users, analyse user behaviour based on variables such as age or gender, or assess the impact of special offers or price changes.
For processing, we use R and RStudio as the tool of choice. R is a programming language that is particularly useful for statistical analysis and works well with large datasets. RStudio is a versatile program that allows the user to generate reports created in R, such as this one
We predominately use the tidyverse package for this stage.
Before cleaning up this data, it is useful to add some new variables:
The first five new variables do not need any special packages and can be accomplished with base R:
##Factorising month and day of week
month_levels <- c(1,2,3,4,5,6,7,8,9,10,11,12)
day_of_week_levels <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
###Creating new variables
cycle_data2 <- cycle_data %>%
mutate(
# cy_year_month = started_at%>%format("%Y-%m"),
cy_year = started_at%>%format("%Y")%>%as.numeric(),
cy_month = started_at %>%format("%m")%>%as.numeric()%>%factor(levels = month_levels),
cy_day = started_at %>% format("%d") %>% as.numeric(),
cy_day_of_week = started_at %>% weekdays() %>% factor(levels = day_of_week_levels),
cy_iso_week = started_at %>% isoweek() %>% as.numeric(),
cy_time_m = difftime(ended_at,started_at,units="mins"),
)
The variable related to distance requires the package ‘geosphere’.
According to the R documentation, geosphere’s function distHaversine calculates the ‘shortest distance between two points (i.e., the ’great-circle-distance’ or ‘as the crow flies’)…assumes a spherical earth, [and] ignor[es] ellipsoidal effects.’
We will use it here to estimate the rough distance between the start and end points per ride:
cycle_data2 <- cycle_data2 %>%
mutate(
cy_distance_m = distHaversine(cbind(start_lng,start_lat),cbind(end_lng,end_lat))
)
With these new variables added, it is now time to clean the dataset.
To assess duplicates we use the function ‘get_dupes’ from the lubridate package
The duplicated ride_id values are most likely due to the change in station id, mentioned earlier. For every instance of a duplicated ride_id, one of the station ids is the older one and one is newer. Each of the newer station id rides have starting times that actually occur after their ending times:
cycle_data2%>%get_dupes(ride_id)%>%select(ride_id,started_at,ended_at,start_station_name, start_station_id, end_station_name, end_station_id,cy_time_m,cy_distance_m)
To remove these duplicates we remove all values where the time_m column is negative
cycle_data2 <- cycle_data2%>%filter(cy_time_m >=0)
#Checking to see if it worked
cycle_data2%>%summarise(n_unique =n_distinct(ride_id),
n_rows = n())
While on the topic of time, let’s take another look at the cy_time_m feature:
cycle_data2%>%select(cy_time_m)%>%summarise(min_time = min(cy_time_m),
av_time = mean(cy_time_m),
med_time = median(cy_time_m),
max_time = max(cy_time_m),
sd = sd(cy_time_m))
cycle_data2 %>% ggplot() +
geom_boxplot(aes(x=cy_time_m))+
labs(title='Length of Cycle Journeys (Mins)',x='Distance Length')
There’s clearly quite a bit of variation here. It’s doubtful that someone would hire a bicycle for over 20000 minutes, so what’s going on?
At a resolution of 24 hours, the spread is outlined in the table below:
cycle_data2 %>% select(cy_time_m)%>%
mutate(class_days = case_when(cy_time_m < 0 ~ "<0",
cy_time_m >= 0 & cy_time_m <= 1440 ~"0-1d",
cy_time_m >1440 & cy_time_m <=4320 ~ "1-3d",
cy_time_m > 4320 ~ "4+"))%>%
group_by(class_days)%>%
summarise(number_trips = n(),
percent_of_total_trips = n()/nrow(cycle_data2)*100)
After looking through the data its clear that a small amount of these lengthy rides are due to bicycles being checked at the main facility. These can be removed. Since Cyclistic offers the possibility of purchasing a day pass for bicycles, it is possible that a minority of riders choose to rent bicycles for longer time periods.
Since these amount to a very small proportion of the wider dataset anyway, we leave them in here.
##remove bike checks &
cycle_data2 <- cycle_data2%>%filter(
# cy_time_m <= 1440 &
end_station_name != "HUBBARD ST BIKE CHECKING (LBS-WH-TEST)" &
start_station_name != "WATSON TESTING - DIVVY")
A short look uncovers that there are many distances that are less than 100m:
cycle_data2 %>%mutate(class_distance = case_when(cy_distance_m <=0 ~ "<=0m",
cy_distance_m > 0 & cy_distance_m < 10 ~ "1-10m",
cy_distance_m >= 10 & cy_distance_m < 100 ~ "10-100m",
cy_distance_m >= 100 ~"100+m"))%>%
group_by(class_distance)%>%
summarise(n = n(),
percent_of_total_trips = n()/nrow(cycle_data2)*100)
When checking the distance values, another interesting thing becomes apparent: close to 10% of all rides, 532948 rows, are below 100m. Of these approximately 80% traveled 0 meters. When we examine the starting and ending docking stations, below, we find out why this is the case:
cycle_data2 %>%filter(cy_distance_m < 1)%>%select(-rideable_type,-start_station_id,-end_station_id,-start_lat,-end_lat,-start_lng,-end_lng,-member_casual,-cy_year,-cy_month,-cy_day,-cy_day_of_week,-cy_iso_week)%>%arrange(desc(cy_time_m))
The starting and ending docking stations are the same, expoising an unfortunate feature of the adjusted dataset As we only have starting and ending stations to work with, calculating the distance between them will not effectively estimate the distance travelled when users travel a circular loop and return the bicycle to the station they picked it up from.
This restricts the conclusions we can draw from summary distance data, with the distance travelled likely to be much higher than if we leave these rows in:
cycle_data2 %>%select(cy_distance_m)%>%rename(original_values_distance_m = cy_distance_m)%>%summary()
## original_values_distance_m
## Min. : 0.0
## 1st Qu.: 879.3
## Median : 1690.5
## Mean : 2205.4
## 3rd Qu.: 2992.8
## Max. :48425.0
cycle_data2 %>%select(cy_distance_m)%>%filter(cy_distance_m >= 1)%>%rename(above1m_distance_m = cy_distance_m)%>%summary()
## above1m_distance_m
## Min. : 1
## 1st Qu.: 1091
## Median : 1869
## Mean : 2424
## 3rd Qu.: 3185
## Max. :48425
cycle_data2 %>%select(cy_distance_m)%>%filter(cy_distance_m >= 100)%>%rename(above100m_distance_m = cy_distance_m)%>%summary()
## above100m_distance_m
## Min. : 100
## 1st Qu.: 1118
## Median : 1891
## Mean : 2450
## 3rd Qu.: 3209
## Max. :48425
cycle_data2 %>%select(cy_distance_m)%>%filter(cy_distance_m >= 500)%>%rename(above500m_distance_m = cy_distance_m)%>%summary()
## above500m_distance_m
## Min. : 500
## 1st Qu.: 1222
## Median : 1979
## Mean : 2557
## 3rd Qu.: 3302
## Max. :48425
For the purpose of this analysis, however, it is useful to leave these rows in.
First, let’s check to see how many NA values remain in these columns
cycle_data2%>%select(start_station_id,end_station_id)%>%sapply(function(x) sum(is.na(x)))
## start_station_id end_station_id
## 523 404
It appears that the earlier measures reduced the number of NAs by quite a bit.
We now take a look at the rows with NA values
##initial look -
cycle_data2%>%filter(is.na(start_station_id) | is.na(end_station_id))%>%arrange(started_at)
###checking to see if end stations are different to start stations
cycle_data2%>%filter(is.na(start_station_id))%>%select(start_station_name)%>%unique()
cycle_data2%>%filter(is.na(end_station_id))%>%select(end_station_name)%>%unique()
##thy are not, so I save the names in a new df
station_NA_names<- cycle_data2%>%filter(is.na(end_station_id))%>%select(end_station_name)%>%rename(NAstation_names = end_station_name)%>%unique()
## there are only two stations without values, and other rows include these values. I use the df from the previous step to find the id number for each station
cycle_data2%>%
filter(start_station_name == station_NA_names$NAstation_names & !is.na(start_station_id))%>%
select(start_station_name,start_station_id)%>%
distinct()
With these understood, we can now replace the values and continue
##Using the Id values to replace NA values
cycle_data2<-cycle_data2%>%
mutate(
start_station_id = if_else(
start_station_name =="W Oakdale Ave & N Broadway" & is.na(start_station_id),"20252.0",
if_else(start_station_name =="W Armitage Ave & N Sheffield Ave" & is.na(start_station_id),"20254.0",start_station_id)),
end_station_id = if_else(
end_station_name =="W Oakdale Ave & N Broadway" & is.na(end_station_id),"20252.0",
if_else(end_station_name =="W Armitage Ave & N Sheffield Ave" & is.na(end_station_id),"20254.0",end_station_id)
))
##checking to see that NAs are removed correctly
cycle_data2%>%select(start_station_id,end_station_id)%>%sapply(function(x) sum(is.na(x)))
## start_station_id end_station_id
## 0 0
Here, the end goal is to be able to assess the popularity of each docking station, as defined by the number of times it has been used as a starting or ending station.
To do this we filter the distinct ids for starting and ending station ids before 2020-11-30, the transition date, from those after. We join these together and merge the id columns so that only the currently used ids are used. Following this, we use left_joins to add the updated id values to the dataset for both start_station_id and end_station_ids, replacing the original columns.
##creating list of starting stations BEFORE transition date
station_list_a_start<-cycle_data2 %>% filter(
started_at < date("2020-12-01")
)%>%
distinct(
start_station_name,
start_station_id
)%>%
rename(
station_name_a = start_station_name,
station_id_a = start_station_id
)%>%
arrange(station_name_a)
##creating list of ending stations BEFORE transition date
station_list_a_end<-cycle_data2 %>% filter(
ended_at < date("2020-12-01")
)%>%
distinct(
end_station_name,
end_station_id
)%>%
rename(
station_name_a = end_station_name,
station_id_a = end_station_id
)%>%
arrange(station_name_a)
##joining both lists together to create comprehensive list
station_list_a_full <- full_join(station_list_a_start, station_list_a_end)
## Joining, by = c("station_name_a", "station_id_a")
##creating list of starting stations AFTER transition date
station_list_b_start<-cycle_data2 %>% filter(
started_at >= date("2020-12-01")
)%>%
distinct(
start_station_name,
start_station_id
)%>%
rename(
station_name_b = start_station_name,
station_id_b = start_station_id
)%>%
arrange(station_name_b)
##creating list of ending stations AFTER transition date
station_list_b_end<-cycle_data2 %>% filter(
ended_at >= date("2020-12-01")
)%>%
distinct(
end_station_name,
end_station_id
)%>%
rename(
station_name_b = end_station_name,
station_id_b = end_station_id
)%>%
arrange(station_name_b)
##joining both lists together to create comprehensive list
station_list_b_full <- full_join(station_list_b_start, station_list_b_end)
## Joining, by = c("station_name_b", "station_id_b")
station_lookup<-station_list_a_full%>%
full_join(
.,station_list_b_full, by=c("station_name_a" = "station_name_b")
)%>%
mutate(
station_id_b = if_else(is.na(station_id_b),station_id_a,station_id_b),
)%>%
select(-station_id_a)%>%
rename(
station_name = station_name_a,
station_id = station_id_b
)%>%
distinct(
station_name, .keep_all = TRUE )%>%
arrange(
station_name
)
cycle_data3 <- cycle_data2%>%
select(-start_station_id, -end_station_id)%>%
left_join(.,station_lookup, by = c("start_station_name" = "station_name"))%>%
rename(start_station_id = station_id)%>%
left_join(.,station_lookup, by =c("end_station_name" = "station_name"))%>%
rename(end_station_id = station_id)
Both user groups use roughly the same proportion of bicycle types.
Members use slightly more classic bike types than casual users, who favour docked bikes.
Bicycle rental more prevelant in summer months.
Members more likely to rent bikes, particularly in winter months.
In 2021, more casual users than members in over summer months June and July
Both members and casual users follow the same general patterns of use throughout the week; however, the following generalisations may be made:
Week Days:
Members:
o From 5am to 10:30 (peaking betwen 7:30am and 8:30am) and 4:30pm to 6pm, Monday to Friday, members make more trips. This mirrors the working day.
o Between 5:30am and 1pm, Saturday and Sunday, members are slightly more likely to make trips
Casual users:
o Between 12 and 4pm, Monday to Friday, casual users are more likely to use bicycles.
o From 18:30pm to ~ 4am, Monday to Friday casual users are more likely, particularly in the hours between 9pm and midnight.
Weekends:
o Both casual users and members follow the same pattern. However, member begin to use bicycles with more regularity at around 5:30am, whereas casual users begin at around 7am. Both groups of users peak at ~ 2pm and experience slight spikes of usage between 12 and 1am.
Bicycle use is driven by members. In 2020, particularly over winter, and up until June 20201, members have driven usage of Cyclistic bicycles. They favour docked bicycles, but also are more likely to choose classic options. They travel the same distances as casual users, but use bicycles ~ 10 minutes less per trip, indicating that they are more conditioned to bicycle travel.
Member TOD patterns indicate that they use Cyclistic bicycles for travel to and from work.
Casual use primarily takes place in Summer with 2021 usage stronger than in 2020. Casual use in April 2021 rose by 475% compared to the previous year and increases have generally been more favourable than members, indicating higher brand visibility compared to 2020. Close to half of all casual users choose docked bicycles. Casual users are less likely to use bicycles for travel to work, but, like members, are much more likely to travel in the evening, particularly after the hours of 9pm.